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Abstract 

Vertebrate genome comparisons revealed that there are highly conserved noncoding sequences (HCNSs) among a wide 
range of species and many of which contain regulatory elements. However, recently emerged sequences conserved in 
specific lineages have not been well studied. Toward this end, we identified 8,198 primate and 21,128 specific HCNSs as 
representative ones among mammals from human-marmoset and mouse-rat comparisons, respectively. Derived allele 
frequency analysis of primate-specific HCNSs showed that these HCNSs were under purifying selection, indicating that they 
may harbor important functions. We selected the top 1 ,000 largest HCNSs and compared the lineage-specific HCNS-flanking 
genes (LHF genes) with ultraconserved element (UCE)-flanking genes. Interestingly, the majority of LHF genes were different 
from UCE-flanking genes. This lineage-specific set of LHF genes was more enriched in protein-binding function. Conversely, 
the number of LHF genes that were also shared by UCEs was small but significantly larger than random expectation, and 
many of these genes were involved in anatomical development as transcriptional regulators, suggesting that certain groups 
of genes preferentially recruit new HCNSs in addition to old HCNSs that are conserved among vertebrates. This group of LHF 
genes might be involved in the various levels of lineage-specific evolution among vertebrates, mammals, primates, and 
rodents. If so, the emergence of HCNSs in and around these two groups of LHF genes developed lineage-specific 
characteristics. Our results provide new insight into lineage-specific evolution through interactions between HCNSs and their 
LHF genes. 
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Introduction 

From the inception of molecular evolutionary studies, protein 
noncoding regions were suspected to be involved in gene reg- 
ulation (Zuckerkandl and Pauling 1965; Britten and Davidson 
1 971 ; King and Wilson 1 975). Now it is widely accepted that 
some noncoding regions play important roles in gene regula- 
tion (e.g., Carroll 2005). The functional elements are expected 
to evolve more slowly than surrounding nonfunctional DNA, 
as they are under purifying selection (Kimura 1983; Nei 
1 987). Therefore, sequences that are more highly conserved 
are likely to be important from the functional point of view. 
In fact, 5% of the human genome is conserved (Mouse 
Genome Sequencing Consortium 2002), a considerably 



higher proportion than that (2%) of the protein coding 
regions (International Human Genome Sequencing Consor- 
tium 2004). Many highly conserved noncoding sequences 
(HCNSs) among vertebrates have now been identified 
(Ahituv et al. 2004; Bejerano et al. 2004; Siepel et al. 
2005), and some of which are reported to function as distal 
enhancers for neighboring genes (e.g., Woolfe et al. 2005; 
Pennacchio et al. 2006). 

The regions conserved in only one restricted lineage such 
as primates, and rodents are considered to be recently 
emerged HCNSs. These HCNSs may have gained new func- 
tions to develop the lineage-specific characteristics after di- 
verging from the ancestral species. However, the commonly 
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accepted strategy for detecting regulatory regions is to 
identify HCNSs among a wide range of species such as ver- 
tebrates. This approach only identifies the regions conserved 
among diverged species and does not detect sequences 
conserved in just one particular small lineage. Indeed, such 
comparisons among vertebrate genomes are known to miss 
a large number of highly constrained mammalian-specific 
functional elements despite the fact that these elements 
are all under similarly intense levels of purifying selection 
in mammals (Aparicio et al. 2002; Hillier et al. 2004; Cooper 
et al. 2005). 

Challenges and limitations exist for studies seeking to iden- 
tify the evolution of regulatory regions by detecting changes 
that were accelerated as a result of lineage-specific positive 
selection (Pollard et al. 2006; Prabhakar et al. 2006, 2008). 
These studies focused only on human-specific changes. How- 
ever, positive selection is not the only possible explanation for 
these lineage-specific accelerated sequences. Biased gene 
conversion (BGC) is a neutral mutation process associated 
with meiotic recombination, which favors a special kind of 
mutation pattern (Marais 2003). BGC can create strong sub- 
stitution hotspots, thereby leading to spurious signatures of 
positive selection (Dreszeretal. 2007; Galtierand Duret2007; 
Duret and Galtier 2009; Sumiyama and Saitou 201 1). In ad- 
dition, there are reports that the selective pressure affecting 
the evolution of regulatory elements in the hominid lineage is 
significantly relaxed compared with that of the rodent lineage 
(Kryukov et al. 2005) and that regulatory elements in hom- 
inids may be diverging at a neutral rate (Keightley et al. 2005). 
All these elements point to the difficulty in detecting evidence 
of positive selection in one lineage. 

Another challenge for finding the lineage-specific regu- 
latory regions was to identify HCNSs found only in one lin- 
eage comprised of very closely related species. The primates 
is one of the lineages of closely related species compared 
with the mammals and vertebrates. Sequence comparisons 
only among primates are likely to capture functional com- 
ponents of the lineage due to shared biological processes 
(Boffelli et al. 2003). However, to date, this strategy of com- 
paring genomes among closely related species has been ap- 
plied only to the very limited regions. Furthermore, the goal 
of this method was to identify all sequences conserved 
among species at various levels of divergence, such as ver- 
tebrates, mammals, and primates but not primate-specific 
HCNSs. In contrast to the lineage-specific phenotypic 
changes, the HCNSs which are conserved only in one par- 
ticular lineage have not been well studied. Thus, to expand 
our understanding of lineage-specific evolution, we identi- 
fied HCNSs that were conserved in a particular lineage 
(either in primates or in rodents) and compared character- 
istics of the lineage-specific HCNSs with those conserved 
among mammals and vertebrates. We used human and 
marmoset genomes for detecting primate-specific HCNSs, 
whereas mouse and rat genomes were used for detecting 




Fig. 1. — Phylogenetic relationship of species mainly used in this 
study. The blue, yellow, and purple circles represent primate-specific, 
rodent-specific, and vertebrate-shared HCNSs, respectively. The approx- 
imate divergence times for ancestral species of each lineage are shown 
on the tree (Mouse Genome Sequencing Consortium 2002; Hedges and 
Kumar 2003; Gibbs et al. 2004; She et al. 2006). 

rodent-specific HCNSs (fig. 1). The lineage-specific HCNSs 
identified in this study are expected to provide new insight 
into how one lineage evolved from a common ancestor. 

Materials and Methods 

Genomes Used in This Study 

We used a total of 13 vertebrate genomes with over 6X 
coverage and high quality since alignments containing 
low-coverage genomes cause misalignment. All genomes 
were obtained from the UCSC Genome Bioinformatics data- 
base (http://genome.ucsc.edu/). They are medaka (Oryzias 
latipes; oryl_at2), stickleback (Gasterosteus aculeatus; 
gasAcul), fugu {Takifugu rubripes; fr2), tetradon {Tetraodon 
nigroviridis; tetNig2), zebrafish {Danio rerio; danRer6), frog 
(Xenopus tropicalis; xenTro2), lizard {Anolis carolinensis; ano- 
Car1), chicken (Gallus gallus; galGal3), dog (Canis familiaris; 
canFam2), horse (Equus caballus; equCabl ), cow (Bos taurus; 
bosTau4), rat {Rattus norvegicus; rn4), mouse {Mus musculus; 
mm9), marmoset {Callithrix jacchus; calJac3), and human 
{Homo sapiens; hg19). Genomic alignments between human 
and marmoset and between mouse and rat were also 
retrieved from the UCSC Genome Bioinformatics database. 
Genome sequences of rhesus macaque (Macaca mulatta; 
rheMac2), orangutan (Pongo pygmaeus abelii; ponAbe2), 
and chimpanzee {Pan troglodytes; panTro3) were also used 
for extraction of primate-specific HCNSs. 

Filtering Repeats and Coding Sequences 

Repetitive sequences (chrN_rmsk tables) in the human and 
mouse genomes were obtained from the UCSC database. 
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All repetitive sequences were excluded from the analysis. Fil- 
tering of coding regions was performed based on the anno- 
tation (CCDS. 20080902) of NCBI CCDS project (http:// 
www.ncbi.nlm.nih.gov/projects/CCDS/) (Pruitt et al. 2009) 
and the Ensembl human database (http://www.ensembl. 
org/) (Hubbard et al 2002). 

Extraction of Primate-Shared and Rodent-Shared HCNSs 

We applied a sliding window analysis to UCSC pairwise 
noncoding alignments of human-marmoset and mouse- 
rat (http://genome.ucsc.edu/). We first extracted the 
repeat-masked noncoding regions and performed sliding 
window analysis. The window and step size were set to 
be 100 bp and 25 bp, respectively. When making sliding 
window sequences, we kept only the sequences that 
had no gap in a window. To estimate P values for 
lineage-specific HCNSs, we calculated the divergence of 
the nongapped noncoding regions between human- 
marmoset and mouse-rat pairwise alignments (—10% 
and -14% for autosomes and -9% and -14% for 
chromosome X, respectively). We assumed that these av- 
erage genome divergences are neutral substitution rates 
and obtained statistical significance of the lineage-specific 
HCNSs by using a binomial distribution. 

Identification of Lineage-Specific HCNSs 

Discontiguous MegaBLAST homology search (Zhang et al. 
2000) was performed to extract primate-specific HCNSs 
against the nonprimate vertebrate genomes. Similarly, 
rodent-specific HCNSs were extracted by performing Mega- 
BLAST search against the nonrodent vertebrate genomes. 
Parameters for MegaBLAST were discontiguous word tem- 
plate size 16 bp, word matches 12 bp, and mismatch pen- 
alty -2. Alignable sequences may be homologous regions. 
We therefore removed the MegaBLAST hits with >30% 
identity and >30 bp in length from primate-shared and 
rodent-shared HCNSs since the sequences with >40% iden- 
tity may contain functional elements (McGaughey et al. 
2008). The homologous sequences among mammals 
(e.g., human and dog) with <30 bp length and <30% iden- 
tity can be found throughout the genome and are assumed 
to be neutral when assessing average genome identity. 
However, the homologous sequences among diverged ver- 
tebrates (e.g., human and fish) are considered to be func- 
tional elements. We removed these alignable sequences 
among vertebrate genomes (birds, lizard, frog, and fish) 
from the lineage-specific HCNSs using UCSC multiway 
alignments. In addition, since there is no closely related 
species available for rodent lineage, we applied further 
filtering only for extraction of primate-specific HCNSs and 
removed the HCNSs that were not found or showed low 
identity (<98%) in the rhesus macaque, orangutan, and 
chimpanzee genomes. 



To make analyses of these lineage-specific HCNSs easier, 
we extracted the top 1,000 largest HCNSs, as longer se- 
quences were considered to be under stronger constraint. 
We assumed that the constraints on the HCNSs in the same 
bin (class of length) were equal. HCNSs were chosen from 
the first bin to the nth bin until the total number approached 
1 ,000. Additional HCNSs were chosen by random from the 
(n + 1 )-th bin to reach the total HCNS numbers to be 1 ,000. 

SNP Detection 

We downloaded human single nucleotide polymorphism 
(SNP) data from the Hapmap database (http://hapmap.ncbi. 
nlm.nih.gov/) and mouse SNP data from NCBI dbSNP build 
128 (http://www.ncbi.nlm.nih.gov/SNP). We extracted only 
the SNP with minor allele frequencies of at least 0.01 in one 
of four populations (YRI, CEU, JPT, and CHB) in humans and 
one of all strains in mice. The densities of the SNPs in the 
repeat masked noncoding sequences in the human and 
mouse genomes were used to estimate the expected SNP 
numbers in HCNSs, and these were compared with the ob- 
served SNP numbers of HCNSs using %2-analysis. 

Derived Allele Frequency Estimation 

To estimate derived allele frequency (DAF), we converted 
the coordinates of primate-specific HCNSs into those of 
hg18 to obtain allele frequencies in human populations pro- 
vided by HapMap release 27. We determined the ancestral 
allele by using chimpanzee alleles defined by UCSC 
snpl 260rthoPt2Pa2Rm2. An SNP locus was discarded when- 
ever the allele of its orthologous chimpanzee locus did not 
match either human allele. We used 2x2 contingency tables 
to compare DAF distribution for SNPs within primate-specific 
HCNSs with all nonrepetitive human noncoding genomes. 

Gene Ontology Analysis 

We looked for significantly enriched gene categories in pri- 
mate and rodent LHF genes in the Gene Ontology (GO) da- 
tabase (http://www.geneontology.org/) (Ashburner et al. 
2000). The assignment of GO terms and the test for statis- 
tical enrichment of those terms were performed with GO- 
stat using goa_human and mgi GO gene association 
database (http://gostat.wehi.edu.au/) (Beissbarth and 
Speed 2004). The enrichment of InterPro domains (http:// 
www.ebi.ac.uk/interpro/) of human and mouse genes asso- 
ciated with HCNSs was determined by Fisher's exact test. 
The correction for multiple comparisons was performed 
by using the false discovery rate option in GOstat. 

Analysis of d/V and dS Levels for LHF Genes 

We obtained ortholog lists from Ensembl through biomart 
for human-marmoset, human-rhesus macaque, and 
mouse-rat pairs (Hubbard et al. 2002) and extracted only 
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the LHF genes (and LHF orthologs) that were located within 
1 Mb of HCNSs in all genomes. d/V and d5 values were also 
downloaded from Ensembl (Vilella et al 2008). These values 
were estimated by using codeml in the PAML package 
(model = 0, NSsites = 0) (Yang 1 997). With d/V and d5 val- 
ues of one-to-one pair orthologs in Ensembl homolog lists, 
we calculated the means of d/V and d5 of LHF genes and all 
genes in the human and mouse genomes. Statistical analysis 
(one-sample Mest, two tailed) was conducted using the R 
package (http://www.r-project.org). For ultraconserved ele- 
ment (UCE)-flanking genes, we used genes that were located 
within 1 Mb of UCEs in both human and mouse genomes. 
With these extracted genes, the same procedure was used 
for estimation of d/V and d5 for UCE-flanking genes. 

Expected Number of Genes Shared by Lineage-Specific 
HCNSs and UCEs 

The expected number of overlapping genes among lineage- 
specific HCNSs and UCEs was calculated by random sam- 
pling simulation. This random sampling weights the chance 
of choosing a gene by the length of the chromosome where 
the gene is located. We randomly selected the same number 
of genes as the primate LHF genes from the human genome, 
those as the rodent LHF genes from the mouse genome, and 
those as the noncoding UCE-flanking genes from both hu- 
man and mouse genomes, in each 10,000 replicates. Using 
these data sets, we counted the number of shared genes 
between primate LHF and UCE-flanking genes, rodent 
LHF and UCE-flanking genes, and primate and rodent 
LHF genes and obtained the expected numbers for overlap- 
ping genes. Chi-squared tests were conducted for observed 
and expected numbers using the R package. 

Results 

Determination of Parameters to Extract Lineage-Specific 
HCNSs 

One important parameter when identifying highly con- 
served sequences among closely related species is the win- 
dow size used to compare sequences. Although larger 
windows have more statistical power to detect significantly 
conserved sequences among closely related species, smaller 
windows provide better resolution for the analysis. Thus, it is 
important to set the smallest possible window size that is still 
large enough to detect conserved sequences. Another im- 
portant parameter for identifying conserved sequences 
among closely related species is the threshold for extraction 
of conserved sequences. Particularly for closely related spe- 
cies, this substitution number must not be too small because 
the effect of sequencing errors on the determination of sig- 
nificantly conserved sequences may not be negligible. Tak- 
ing this into consideration, a threshold of 100% identity 
increases the number of false negative identification of 



HCNSs and is thus too strict. For these reasons, the window 
size for sequence comparison among closely related species 
should be determined by considering the substitution num- 
bers within a given window. 

By way of preliminary analysis, we examined which win- 
dow size was most appropriate for identifying HCNSs in 
closely related species by assuming a simple model in which 
the substitution rate within a given window follows a bino- 
mial distribution. The window size setting is a more sensitive 
process in the human and marmoset comparison than that 
in the mouse and rat comparison because the average ge- 
nomic divergence between human and marmoset (nongap- 
ped noncoding region: —10%) is smaller than that between 
mouse and rat (-14%). We thus estimated the number of 
substitutions in HCNSs between human and marmoset. 
First, we defined that the HCNSs reside in the lowest 5% 
of the left tail of the distribution and obtained the expected 
number of substitutions of HCNSs in 50, 1 00, 1 50, and 200 
bp windows using the average genome identity of nongap- 
ped noncoding region as a neutral rate. We then chose to 
use 100 bp since the length was relatively small but the 
range of expected substitution numbers in the HCNS was 
between 0 and 5. This window size also has adequate sta- 
tistical power for rodents whose genetic divergence was 
larger than primates. We therefore used the same window 
length to extract rodent HCNSs for the simplicity of the 
analysis. 

The first step involved extraction of 100 bp primate/ 
rodent shared conserved sequences from human- 
marmoset/mouse-rat pairwise alignments (2 Gbp and 1.8 
Gbp alignments, respectively; fig. 2). The pairwise align- 
ments were obtained from the UCSC Genome Bioinfor- 
matics database. We first limited the region used in 
this analysis to repeat masked noncoding sequences 
(350 Mbp in human-marmoset and 210 Mbp in mouse- 
rat pairwise alignments). By using this repeat masked 
noncoding human-marmoset and mouse-rat pairwise 
alignments, we created 100-bp sliding windows, which 
have no gaps from nonrepetitive alignments, with a step size 
of 25 bp. The total numbers of repeat-masked and non- 
gapped 100-bp sliding windows in human-marmoset and 
mouse-rat were 13,618,548 and 8,187,889, respectively. 

From these windows, we extracted HCNSs using empir- 
ical conservation cutoffs and obtained only the sequences 
which have no substitution (0.78% of the total 1 00-bp frag- 
ments), one substitution or fewer (2.2%), and two substi- 
tutions or fewer (4.3%) in the human and marmoset 
pair, and those with no substitution (0.94%), one or fewer 
(2.5%), and two or fewer (4.5%) in the mouse and rat pair. 
When we extracted sequences that had three or fewer sub- 
stitutions, the percentage of the total fragments exceeded 
5% in both comparisons. Thus, we determined that a sub- 
stitution number of two was the appropriate threshold for 
extraction of HCNSs of primates and rodents (P< 2.1 x 10~ 3 
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Fig. 2. — The procedure of extraction of primate- and rodent- 
specific HCNSs. Pairwise alignments represent human-marmoset and 
mouse-rat alignments for extraction of primate- and rodent-specific 
HCNSs, respectively. After the step "remove vertebrate homologous 
regions," another filtering was applied for primate comparison, and the 
sequences that were not conserved in other primate species (rhesus 
macaque, orangutan, and chimpanzee) were removed. 

and P < 1.5 x 10~ 5 , respectively, the binomial model). 
These numbers determined nucleotide identity thresholds 
for highly conserved regions in primates and rodents to 
be >98% ([1 00 - 2]/1 00). Note that we determined thresh- 
olds specifically for chromosome X because the mutation 
rate on the X chromosome of mammalian genomes is 
known to be lower than that of autosomes (Miyata et al. 
1987; Takahata et al. 1995; Makova and Li 2002). To esti- 
mate the appropriate calibration parameter for chromo- 
some X, we examined the number of substitutions in 
autosome and chromosome X separately in nongapped 
noncoding pairwise alignments of human-marmoset and 
mouse-rat pairs. In the human-marmoset comparison, 
we did not observe any difference in the number of substi- 
tution in the HCNSs between autosome and chromosome X 
and found 2 substitutions in a window (>98% identity, P < 
4.8 x 10~ 3 , the binomial model). However, in the mouse- 
rat comparison, we observed difference between the two 
and found only one substitution in a window (>99% iden- 
tities, P < 1.2 x 10~ 5 , the binomial model). We calibrated 
the threshold for chromosome X only for mouse-rat se- 
quences as 99% ([100 - 1 ]/1 00) and obtained a total of 
590,678 and 356,529 conserved 100-bp sequences from 
the human-marmoset pair and mouse-rat pair, respectively. 
These extracted sequences account for less than 2% of the 
human and mouse genomes. The extraction of conserved 



sequences is, however, only a starting point for the extrac- 
tion of lineage-specific HCNSs, which were filtered further in 
the next step. 

Extraction of Lineage-Specific HCNSs 

The second step is an extraction of lineage-specific HCNSs. 
We performed MegaBLAST search against vertebrate ge- 
nomes with extracted HCNSs (590,678 and 162,304 from 
primates and rodents, respectively). The primate- and 
rodent-specific HCNSs are conserved sequences that have 
emerged after the divergence of these lineages from their an- 
cestors (fig. 1), such that they are found only in primates or 
rodents. We thus removed all HCNS homologous sequences 
that were found in nonprimate and nonrodent vertebrates. 
For extraction of primate-specific HCNSs, an additional filter- 
ing criterion was applied, and we limited to the HCNSs that 
were also conserved in the rhesus macaque, orangutan, and 
chimpanzee genomes. The remainders were 8,198 and 
21,128 for primate- and rodent-specific HCNSs, respectively. 
For simplicity, we used the top 1,000 largest lineage-specific 
HCNSs for both rodents and primates when comparing their 
characteristics. The primate- and rodent-specific HCNSs 
range in size from 1 25 to 375 bp and 1 75 to 425 bp, respec- 
tively. Figure 3A and B show the average numbers of substi- 
tutions per site (approximated with p-distance) in those 1 ,000 
primate- and rodent-specific HCNSs and their ±10,000 bp 
flanking regions, respectively. The patterns indicate that 
only the HCNSs are under the strong constraints, relative 
to their flanking regions. The ±500 bp flanking regions 
shown in the insets showed a smaller number of differences 
compared with those of genome averages. However, the 
number of substitutions of lineage-specific HCNSs is clearly 
much lower even when compared with that of ±500 bp 
flanking regions. 

SNP Analysis 

A SNP is a good indicator for detection of the selective 
constraint on the sequence in question. We investigated 
the number of SNPs overlaid on the lineage-specific HCNSs 
in humans and mice (Sherry et al. 2001 ; HapMap Consortium 
2005) and found that less than 10% of lineage-specific 
HCNSs had SNPs (MAF < 0.01). The majority of lineage- 
specific HCNSs have no SNPs and the numbers of SNPs in both 
primate (SNP density per site: 9.74 x 10~ 4 ) and rodent 
(1.96 x 10~ 4 ) lineages were significantly smaller than 
genome-wide averages (1.6 x 10~ 3 and 5.0 x 10~ 4 for 
human and mouse nonrepetitive genomes, respectively). 
(P << 0.01 for primate- and rodent-specific HCNSs, 
Chi-squared test.) 

To measure the relative level of purifying selection acting 
on HCNSs, we analyzed DAF distributions of primate- 
specific HCNSs and compared these distributions with those 
of the human genome (fig. 4). Purifying selection is likely the 
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Fig. 3. — Substitution rates in lineage-specific intergenic HCNSs and their flanking regions. The average substitution number per site within 1 00-bp 
window in the range of ±10,000 bp of the top largest 1,000 primate-specific HCNSs {A) and rodent-specific HCNSs (£). The insets show enlarged 
distributions in the range of ±1,500 bp. The red lines represent average substitution numbers per site of nongapped noncoding regions in the human 
and mouse genomes, respectively. The error bars are 95% confidence intervals of substitution rate in each window. 



main evolutionary force preventing the vertebrate HCNS 
from accumulating mutations (Katzman et al. 2007). Quan- 
titatively, the signature of the purifying selection can be ob- 
served as a shift in the allele frequency toward ancestral 
alleles (Drake et al. 2006). We observed the levels of DAF 
< 0.1 and 0.2 within primate-specific HCNSs in three human 
populations: Yoruba (YRI), Han Chinese + Japanese (ASN), 
and American of European Ancestor (CEU). At the level of 
DAF < 0.1, only the YRI and ASN populations showed a 



significant excess of rare-derived alleles of SNPs within 
primate-specific HCNSs compared with the genome aver- 
age (P < 0.05, Chi-squared test). However, at the level of 
DAF < 0.2, all populations showed a significant excess of rare 
allele of the SNPs (P < 0.006, Chi-squared test). This is 
consistent with previously published results on non-lineage- 
specific HCNSs (Drake et al. 2006; Ovcharenko 2008, 
Katzman et al 2007), suggesting that purifying selection is 
acting on the primate-specific HCNSs. 
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Fig. 4. — DAF distribution in primate-specific HCNS. DAF distribu- 
tion of Yoruba from Nigeria (YRI) (A), Han Chinese from Beijing 
combined with Japanese from Tokyo (ASN) (B), and American of 
European ancestry (CEU) (0- Light gray and blue bars represent data for 
SNPs in the nonrepetitive human genome and SNPs within primate- 
specific HCNSs. Error bars were estimated using binominal distribution 
as a 2 = (pq)/n, where p represented the fraction of SNPs in a particular 
bin, q represented 1 - p, and n represented the total number of SNPs. 
All primate-specific HCNSs (8,198) were used for this analysis. 

General Features of the Lineage-Specific HCNSs 
Genie Category 

To determine if there are any general trends in the distribu- 
tion of the lineage-specific HCNSs, we compared three an- 
notation categories (intron, intergenic, and UTR) of the 
lineage-specific HCNSs. The fractions of lineage-specific 
HCNSs and their genie categories are shown in figure 5. 
The fractions of HCNSs residing in UTRs and introns in both 




UTRs 1.2% 

51.8% 
Intronic Intergenic 



B UTRs 0.6% 





Fig. 5. — Fractions of genie categories in whole genomes and 
lineage-specific HCNSs. The pie charts show percentages of genie 
categories in the human genome (left) and primate-specific HCNSs 
(right) (A), in the mouse genome (left) and rodent-specific HCNSs (right) 
(B). The percentages of UTRs become markedly elevated in the lineage- 
specific HCNSs. The distribution of genie categories between genomes 
and lineage-specific HCNSs showed significant difference (P < 10~ 15 , 
Chi-squared test). 

primates and rodents were much higher than those of the 
whole human and mouse genomes, respectively (P < 1 0~ 1 5 , 
Chi-squared test). The most striking difference was found in 
the UTR category, where the fractions of UTR in the primate 
and rodent-specific HCNSs were 3 times and 5.5 times high- 
er than those of human and mouse genomes, respectively. 
This increased fraction of UTRs is consistent with the ten- 
dency of HCNSs in vertebrates (e.g., Siepel et al. 2005; 
Woolfe et al. 2005). However, the fractions of UTR category 
in lineage-specific HCNSs were lower than those in non- 
lineage specific vertebrates in UTRs (-6%). In addition, 
the fractions of intergenic + UTR and intronic categories 
differed between primate- and rodent-specific HCNSs 
(P = 1.04 x 10" 4 , Chi-squared test). 

GO Analysis of the Lineage-Specific HCNS-Flanking 
Genes 

The function of genes that are located near lineage-specific 
HCNSs may provide important information for understand- 
ing the lineage-specific evolution. We therefore examined 
the statistically overrepresented functions of the lineage- 
specific HCNS-flanking genes (LHF genes). We first obtained 
the distance between the lineage-specific HCNSs and their 
LHF genes and found that 96.2% and 97.5% of intergenic 
HCNSs are located within 1 Mb of the transcription start site 
of LHF genes in human and mouse, respectively. The longest 
distance between a target developmental gene and its ex- 
perimentally verified enhancer is — 1 Mb; the reported genes 
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are SHH (Lettice et al. 2003; Sagai et al. 2005), SOX9 (Bishop 
et al. 2000), and SHOX (Sabherwal et al. 2007). These find- 
ings indicate that at least some HCNSs may be associated 
with LHF genes as distal regulatory elements. 

Next, we looked for functional categories of LHF genes as 
defined by the GO database (Beissbarth and Speed 2004) 
and obtained significantly enriched functions of LHF genes. 
The top 30 overrepresented gene functions are shown in 
table 1 . In primate and rodent LHF genes, statistically over- 
represented functions were developmental process, protein 
binding, and regulation of transcription. In developmental 
process, anatomical structure (P = 3.1 x 10~ 66 and 
P=3.1 x 10~ 66 for primate and rodent LHF genes, respec- 
tively) and nervous system development (P = 6.1 x 10~ 54 
and P = 1 .9 x 1 0~ 13 ) were enriched. In transcriptional reg- 
ulation, positive regulation of transcription (P = 7.1 x 10~ 23 
for primate LHF genes) and regulation of transcription 
(P = 5.0 x 10~ 7 for rodent LHF genes) were overrepre- 
sented. This tendency of overrepresented gene functions 
is consistent with previous studies of HCNSs among 
vertebrates (e.g., Bejerano et al. 2004; Siepel et al. 2005; 
Woolfe et al. 2005). However, both primate- and rodent- 
specific LHF genes showed significant overrepresentation 
of protein binding (P = 9.7 x 10" 36 and P = 3.3 x 10" 16 ) 
compared with vertebrate HCNSs (e.g., Bejerano et al. 
2004; Siepel et al. 2005; Woolfe et al. 2005). We further 
examined the LHF genes in the protein-binding category 
and found that they were enriched in nervous system devel- 
opment (P= 2.93 x 10~ 37 ), positive regulation of transcrip- 
tion (P = 8.27 x 10~ 13 ), and transcription cofactor binding 
(P = 2.3 x 1 0~ 9 ), which are also important for developmen- 
tal process and transcriptional regulation. 

Selective Constraints on LHF Genes 

In order to identify evolutionary constraints on LHF genes, 
we examined the nonsynonymous (d/V) and synonymous 
(d5) substitution rates in LHF genes. First, d/V and d5 values 
for human-marmoset and mouse-rat pairs were obtained 
from Ensembl (Vilella et al. 2008). Using primate (or rodent) 
LHF genes that had annotated orthologs within 1 Mbp of 
the HCNS in human and marmoset (mouse and rat) ge- 
nomes, we calculated the means of d/V and d5 of genes 
in each pair. We then compared d/V and d5 of LHF genes 
with those of the genome average. LHF genes are expected 
to have important functions, and in fact, means of 6N/6S 
ratios in all pairs were significantly smaller than those of ge- 
nome averages (P < 0.001 in all pairs, one-sample f-test). 
However, d5 values of primate and rodent LHF genes were 
significantly smaller than those of genome averages as well 
as d/V values (P < 0.001 in human-marmoset and mouse- 
rodent pairs; table 2), indicating stronger constraint on 
flanking genes not only at amino acid sequence level but 
also at nucleotide level. 



The main advantage of studying HCNSs that are found in 
only one particular lineage is that we can compare evolu- 
tionary constraints on the LHF genes with those of orthologs 
that have no HCNS in another lineage. To investigate differ- 
ences in selective constraints on LHF genes and their ortho- 
logs that have no lineage-specific HCNS, we compared d/V 
and d5 levels of orthologs of primate (rodent) LHF genes 
with genome averages in rodent (primate) pair. We defined 
orthologs of primate LHF genes in rodents and those of ro- 
dent LHF genes in primates as primate LHF orthologs and 
rodent LHF orthologs, respectively (see fig. 6). As in the 
LHF genes, the primate and rodent LHF orthologs had sig- 
nificantly smaller 6N/6S ratios as well as d/V levels, when 
compared with those of genome averages in mouse-rat 
(human-marmoset) pair (P < 0.001, one sample f-test; 
table 2A and table 2B). 

On the other hand, there was no significant difference 
between d5 values of primate and rodent LHF orthologs 
and those of genome-wide genes (P > 0.05, table 2A 
and table 2B). This finding indicates that the evolutionary 
constraint on LHF genes is stronger than the constraint 
on genes that have no lineage-specific HCNSs. We also an- 
alyzed levels of constraints using d/V and d5 values in genes 
flanking noncoding UCEs which are an extreme case of 
HCNSs among vertebrates (Bejerano et al. 2004). All mean 
values (d/V/d5 ratio, d/V and d5 values) of UCE-flanking 
genes were significantly smaller than genome averages (ta- 
ble 2C). This result further supports the finding lower d5 is 
an important signature of the genes that are associated with 
HCNSs. 

Comparison of Lineage-Specific HCNS-Flanking Genes 
with Vertebrate HCNS-Flanking Genes 

GO analysis of the LHF genes showed that the most statis- 
tically overrepresented functions were developmental 
process and transcriptional regulation, which were quite 
similar to the overrepresented functions of HCNSs con- 
served among vertebrates (e.g., Bejerano et al. 2004; 
Woolfe et al. 2005). This observation raised the question 
whether the lineage-specific HCNSs are found near the 
same genes as those of vertebrate HCNSs whose origin 
are older than primate- and rodent-specific HCNSs. 

To address this question, we first examined the number of 
flanking genes that were shared among primate- and 
rodent-specific HCNSs and noncoding UCEs (fig. 7). The ma- 
jority of primate- and rodent-specific HCNSs (980 and 985, 
respectively) had LHF genes within 1 Mbp. They were often 
clustered near a small subset of genes, and the numbers of 
LHF genes for primate- and rodent-specific HCNSs were 820 
and 516, respectively. This is consistent with previous studies 
of vertebrate HCNSs including UCEs. Furthermore, a total of 
1 1 LHF genes were shared among lineage-specific HCNS- 
and UCE-flanking genes (fig. 7A). The number of genes 
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Table 1 

Top 30 Overrepresented Functions of LHF Genes 

GO Gene Function P Value 



(A) Primate LHF genes 



GO:0048856 


Anatomical structure development 


3.10 


X 


1Q -66 


GO:0048731 


System development 


2.10 


X 


1Q -64 


GO:0007275 


Multicellular organismal development 


2.32 


X 


10 58 


GO:0032502 


Developmental process 


3.18 


X 


10" 55 


GO:0007399 


Nervous system development 


6.11 


X 


10~ 54 


GO:0032501 


Multicellular organismal process 


9.41 


X 


10- 51 


GO:0005515 


Protein binding 


9.72 


X 


1Q -36 


GO:0009653 


Anatomical structure morphogenesis 


2.66 


X 


10- 32 


GO:0048869 


Cellular developmental process 


6.39 


X 


10~ 29 


GO:0030154 


Cell differentiation 


6.39 


X 


10 29 


GO:0048513 


Organ development 


8.93 


X 


10 29 


GO:0007154 


Cell communication 


5.11 


X 


10- 24 


GO:0050789 


Regulation of biological process 


2.08 


X 


10 23 


GO:0065007 


Biological regulation 


7.14 


X 


io- 23 


GO:0045941 


Positive regulation of transcription 


1.39 


X 


10 22 


GO:0045935 


Positive regulation of nucleobase, nucleoside, nucleotide, and nucleic acid metabolic process 


1.98 


X 


10- 21 


GO:0050794 


Regulation of cellular process 


5.97 


X 


10~ 21 


GO:0031325 


Positive regulation of cellular metabolic process 


1.49 


X 


1Q -20 


GO:0009893 


Positive regulation of metabolic process 


7.10 


X 


1Q -20 


GO:0009887 


Organ morphogenesis 


2.53 


X 


10- 19 


GO:0007165 


Signal transduction 


8.86 


X 


10~ 19 


GO:0000902 


Cell morphogenesis 


1.60 


X 


1Q -16 


GO:0032989 


Cellular structure morphogenesis 


1.60 


X 


1Q -16 


GO:0022008 


Neurogenesis 


6.43 


X 


1Q -16 


GO:0008134 


Transcription factor binding 


1.09 


X 


10 15 


GO:0048468 


Cell development 


1.49 


X 


10" 15 


GO:0003712 


Transcription cofactor activity 


1.57 


X 


10~ 15 


GO:0007267 


Cell-cell signaling 


6.28 


X 


10" 15 


GO:0048522 


Positive regulation of cellular process 


7.14 


X 


10" 15 


GO:0048699 


Generation of neurons 


1.09 


X 


10- 14 


) Rodent LHF genes 










GO:0005515 


Protein binding 


3.03 


X 


10~ 17 


GO:0007275 


Multicellular organismal development 


4.43 


X 


10" 15 


GO:0048731 


System development 


7.66 


X 


10~ 14 


GO:0032502 


Developmental process 


1.89 


X 


10 13 


GO:0048856 


Anatomical structure development 


1.92 


X 


10~ 13 


GO:0007399 


Nervous system development 


1.92 


X 


10- 13 


GO:0009653 


Anatomical structure morphogenesis 


3.25 


X 


10" 12 


GO:0050789 


Regulation of biological process 


4.09 


X 


10 12 


GO:0065007 


Biological regulation 


2.32 


X 


10~ 11 


GO:0032990 


Cell part morphogenesis 


1.00 


X 


10 -io 


GO:0030030 


Cell projection organization and biogenesis 


1.00 


X 


10 -io 


GO:0048858 


Cell projection morphogenesis 


1.00 


X 


10 -io 


GO:0009887 


Organ morphogenesis 


1.02 


X 


10 -09 


GO:0048666 


Neuron development 


1.02 


X 


10 -09 


GO:0050794 


Regulation of cellular process 


1.40 


X 


1Q -09 


GO:0031175 


Neurite development 


3.86 


X 


1Q -09 


GO:0007167 


Enzyme-linked receptor protein signaling pathway 


1.18 


X 


10 -08 


GO:0048869 


Cellular developmental process 


3.45 


X 


1Q -08 


GO:0030154 


Cell differentiation 


3.45 


X 


10 -08 


GO:0048513 


Organ development 


3.72 


X 


1Q -08 


GO:0030182 


Neuron differentiation 


4.58 


X 


1Q -08 


GO:0022610 


Biological adhesion 


4.74 


X 


10 -08 


GO:0007155 


Cell adhesion 


4.74 


X 


1Q -08 


GO:0000904 


Cellular morphogenesis during differentiation 


1.26 


X 


1Q -07 


GO:0007267 


Cell-cell signaling 


1.68 


X 


1Q -07 
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Table 1 
Continued 

GO Gene Function P Value 

GO:0005216 Ion channel activity 1.81 x 1CT 07 

GO:0016477 Cell migration 2.00 x 10 07 

GO:0010468 Regulation of gene expression 3.90 x 10~ 07 

GO:0022838 Substrate-specific channel activity 4.53 x 10 07 

GO:0045449 Regulation of transcription 4.99 x 10~ 07 



Note. — The P value was determined by Fisher's exact test and corrected with false discovery rate method. Only the gene functions belonging to "Biological process" and 
"Molecular function" of GO category are shown (Ashburner et al. 2000). A total of 980 and 985 LHF genes that locate within 1 Mbp of HCNSs were used for primate- and rodent- 
specific HCNSs, respectively. 



shared by primate-specific HCNSs and UCEs and rodent- 
specific HCNSs and UCEs were 31 and 41, respectively 
(fig. IB and C in the Venn diagram). Interestingly, we found 
that the numbers of genes shared by lineage-specific HCNSs 
and UCEs were significantly larger than random expectation 
(P < 1 0~ 4 ). Overrepresented functions of GO categories for 
the LHF genes overlapping with UCE-flanking genes were 
mainly involved in regulation of transcription, DNA binding, 
anatomical structure development, and intracellular 
membrane-bound organelle (fig. 7, scatter plots A through 
C). Note that rodent LHF genes overlapping with UCE- 
flanking genes were mildly enriched in anatomical structure 
development (P < 0.056). Many of these genes shared by 
lineage-specific HCNSs and UCEs are well studied and 
known to play an important role as transcriptional regula- 
tors during vertebrate development. 

Similarly, we examined LHF genes shared by primate- and 
rodent-specific HCNSs and found that the numbers of 
overlapping LHF genes were also significantly larger than 

Table 2 

d/V and d5 Values 



random expectation (P < 10~ 4 ) (fig. ID in the Venn dia- 
gram). In contrast, the proportion of primate and rodent 
LHF genes that were not shared by any gene set were 
82.4% and 74.6%, respectively (fig. IE and F in the Venn 
diagram). This finding demonstrates that the majority of the 
LHF genes are lineage specific. The main overrepresented 
gene functions were regulation of transcription, protein 
binding, anatomical structure development, and intracellu- 
lar membrane-bound organelle (fig. 7 scatter plots D 
through F). Taking into consideration that the UCE-flanking 
genes that do not include LHF genes were enriched in both 
DNA binding and protein binding, whereas genes involved 
in protein binding were less significant compared with those 
of DNA binding, the difference between the gene functions 
of data sets A through F was whether or not DNA binding 
was overrepresented (fig. 7, scatter plot G). 

Examples of these overlapping LHF genes are shown in 
figure 8. Lineage-specific HCNSs as well as UCEs were found 
in and around the PBX genes. A primate-specific HCNS and 



All genes 3 

(A) d/V and 65 of LHF genes in the human-marmoset pair 
Number of genes 15,011 
Average d/V 0.0407 (0.0494) 
Average 65 0.1684 (0.1140) 
Average 6N/65 h 0.2495 (0.2599) 

All genes c 

(B) 6N and 65 of LHF genes in the mouse-rat pair 
Number of genes 16,104 
Average d/V 0.0408 (0.0509) 
Average 65 0.2091 (0.0877) 
Average 6N/65 b 0.1910 (0.2360) 



(C) 6N and 65 of UCE-flanking genes 
Number of genes 
Average d/V 
Average 65 
Average dWdS h 



Primates 
141 

0.0245* (0.0378) 
0.1225** (0.0844) 
0.1857* (0.2381) 



Primate LHF genes 
462 

0.0250** (0.0362) 
0.1306** (0.0834) 
0.1852** (0.2123) 

Rodent LHF genes 
517 

0.0285** (0.0434) 
0.1943* (0.0948) 
0.1330** (0.1731) 

Rodents 
122 

0.0208* (0.0244) 
0.1705** (0.0748) 
0.1075** (0.1078) 



Orthologs of rodent LHF genes 
319 

0.0255** (0.0287) 
0.1740 (0.1279) 
0.1585** (0.1710) 

Orthologs of primate LHF genes 
306 

0.0262** (0.0352) 
0.2002 (0.0815) 
0.1241** (0.1248) 



Note. — Numbers in parentheses represent standard deviation. Values with asterisks represent significant differences (P < 0.001, one sample f-test) from the genome averages. 
** and * represent P < 10~ 9 and 10~ 3 , respectively. 
a All genes in the human genome. 
b 6N/65 ratio was calculated only for genes with 65 > 0. 
c All genes in the mouse genome. 



650 Genome Biol. Evol. 4(5):641-657. doi:10.1093/gbe/evs035 Advance Access publication April 13, 2012 



Lineage-Specific HCNSs in Mammals 



GBE 



Prim ate LHF gen e 




Rod ent LHF gen e 

-O H Gene B h 
HCNS 



Fig. 6. — Definition of LHF orthologs. The primate and rodent LHF 
orthologs are defined as the ortholog of primate LHF gene in rodents 
(Gene A in rodents), and the ortholog of rodent LHF gene in primates 
(Gene B in primates). Although primate and rodent LHF genes recruited 
lineage-specific HCNSs after the divergence of each lineage from the 
common ancestor, the majority of primate and rodent LHF orthologs did 
not. 

UCEs were located within PBX1 , and rodent-specific HCNSs 
and UCEs were located in and around PBX3 (fig. SA and B). 
PBX genes act as cofactors for various transcription factors 
such as HOX genes (e.g., Rauskolb and Wieschaus 1994; 
Mann 1995, Mann and Chan 1996, Mann and Affolter 
1998) and are involved in chromatin modification (e.g., 
Cirillo et al. 2002; Berkes et al. 2004). The lineage-specific 
HCNSs were also associated with SOX genes. Both primate- 
and rodent-specific HCNSs were found in and around 
SOX13 and SOX6 (Fig. 8Cand D), respectively. In addition, 
a primate-specific HCNS and a rodent-specific HCNS were 
also located near SOX9 and SOX1 1 , respectively. SOX genes 
are transcriptional activators that are required for normal 
development of the central nervous, chondrogenesis, and 
maintenance of cardiac and skeletal muscle cells (Wegner 
and Stolt 1995; Wegner 1999). 

MEF2C is an interesting example of primate and rodent 
HCNSs and UCE shared LHF genes. The genomic positions of 
the lineage-specific HCNSs and UCEs for MEF2C are shown 
in figure 8E. MEF2C belongs to the evolutionarily ancient 
MADS family of transcription factors which play central roles 
in the transmission of extracellular signals to the genome 
and in the activation of the genetic programs that control 
cell differentiation, proliferation, morphogenesis, survival, 
and apoptosis of a wide range of cell types (Shore and 
Sharrocks 1995; Potthoff and Olson 2007). TLE genes also 
recruited lineage-specific HCNSs. Another LHF gene that 
was shared among primate- and rodent-specific HCNSs 
and UCEs is TLE4 (fig. SF). This is a transcriptional compres- 
sor that binds to a number of transcription factors and 



inhibits the transcriptional activation mediated by PAX5 
and by CTNNB1 and TCF family members in Wnt signaling 
(Roose et al. 1998; Eberhard et al. 2000; Yaklichkin et al. 
2007). An interesting exceptional example of the LHF genes 
is NPAS3 (fig. 8G). This gene is a brain-enriched transcription 
factor belonging to the basic helix-loop-helix-PAS superfam- 
ily, the members of which carry out diverse functions, includ- 
ing circadian oscillations, neurogenesis, toxin metabolism, 
hypoxia, and tracheal development (Kamnasaran et al. 
2003), shown as HAR in figure 8G. NPAS3 has not only ver- 
tebrate HCNSs but also a human-accelerated region (Pollard 
et al. 2006). Three primate-specific HCNSs, one rodent- 
specific HCNS, and three UCEs are located in NPAS3. 
Another example of an LHF gene that is thought to have 
a critical impact on lineage-specific evolution is FOXP1. 
Primate-specific HCNSs, a rodent-specific HCNS, and three 
UCEs were found in and around FOXP1 (fig. SH). FOXP1 is 
a member of the FOX family of transcription factor and plays 
important roles in the regulation of tissue- and cell type- 
specific gene transcription (e.g., Kaufmann and Knochel 
1996; Carlsson and Mahlapuu 2002). 

Discussion 

In this study, we identified a total of 8,198 primate- and 
21,128 rodent-specific HCNSs and found that the lineage- 
specific HCNSs showed the signature of purifying selection 
at the SNP level as well as the nucleotide level (figs. 3 and 4). 
We found that the LHF genes as a whole were enriched in 
gene functions similar to those UCE-flanking genes (table 2). 
However, the majority of LHF genes and UCE-flanking genes 
are independent sets (fig. 7E-G in the Venn diagram). This 
suggests that a particular group of genes are preferentially 
associated with lineage-specific HCNSs instead of vertebrate 
HCNSs (fig. 7, scatter plots E through G). Thus, this group of 
genes may be regulated through HCNSs in a lineage-specific 
manner. Similarly, we found that there are UCE-flanking 
genes that have no lineage-specific HCNSs. These genes 
are thought to be a core set that may have contributed 
to the development of fundamental characteristics of verte- 
brates. On the other hand, we found that the number of LHF 
genes that were also UCE-flanking genes was significantly 
larger than random expectation (fig. IB and C in the Venn 
diagram). This suggests that certain groups of genes tend to 
recruit new HCNSs in addition to the vertebrate (old) HCNSs 
such as UCEs. The genes at the intersection of all lineages 
were the most extreme example (fig. 7A in the Venn dia- 
gram). These genes may have contributed to the evolution 
of different levels of organisms, for example, vertebrates, 
primates, and rodents. A particularly noteworthy feature 
of LHF genes is that even genes that are highly conserved 
among vertebrates, and which play an important role in 
the developmental process (e.g., FOXP1 and PBX genes), 
have lineage-specific HCNSs. This is so because regulatory 
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Fig. 7. — Comparison of genes among lineage-specific HCNSs and UCEs. Upper 7 panels show the scatter plots of the number of overrepresented 
gene functions and their P values obtained by GO analysis. The letters A through G in the scatter plots are corresponding to the letters in the Venn 
diagram, which shows the number of overlapping LHF genes among primate- and rodent-specific HCNSs and UCEs (numbers in parentheses). 



regions for the conservative genes are also conserved 
among a wide range of species. 

Although the primate- and rodent-specific HCNSs are 
found only in primates and rodents, respectively (fig. 1), 
there may be HCNSs that have been lost only in these lin- 
eages. Comparisons of ancient vertebrate conserved non- 
coding elements (aCNEs) which were present in the 
common ancestor of jawed vertebrates in Hox cluster loci 
showed that many of aCNEs have diverged beyond recog- 
nition in teleost fish (Lee et al. 2008). However, another 



study of HCNSs among four mammals showed that the loss 
rate of ultraconserved-like HCNSs in rodents was only 
0.086% (Mclean and Bejerano 2008). It is also known that 
there is a slowdown in substitution rates of UCEs in tetra- 
pods (Stephen et al. 2008). These observations indicate that 
the loss rate of mammalian and tetrapod HCNSs was smaller 
than expected. Therefore, if there are HCNSs that have been 
lost only in primate and rodent lineages, many of them are 
expected to be derived from aCNEs. The evolutionary pro- 
cess by which new regulatory networks are created may be 
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Fig. 8. — Examples of lineage-specific HCNS and UCE distributions. Purple, light blue, and yellow circles represent the position of UCE, primate-, 
and rodent-specific HCNSs, respectively. Examples of PBX1 {A), Pbx3 (B), SOX1 3 (O, Sox6 (D), MEF2C (E), TLE4 (F), NPAS3 (G), and FOXP1 (H) are shown 
in the figure. When LHF genes are of primate-specific HCNSs, the distribution of HCNSs and UCEs are always shown on the human genes. All genes but 
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driven by the addition and loss of HCNSs to genes that play 
an important role in development. 

It is also possible that new lineage-specific HCNSs and old 
vertebrate or mammalian HCNSs have the same function 
even if their sequences are not homologous. This suggests 
that the new-lineage specific HCNSs were created during 
functional turnover and that gaining the new HCNSs did 
not contribute to the lineage-specific evolution. In such 
cases, the gain and loss of HCNSs frequently occur within 
and around these vertebrate developmental genes and their 
gene expressions are maintained among a wide range of 
species. However, it is impossible to know whether or not 
nonhomologous long sequences have the same function 
by performing computational analysis alone. 

In TF binding sites, it is known that there is turnover, in- 
cluding gain and loss of DNA sequence motifs (Cliften et al. 
2003; Kellis et al. 2003; Gasch et al. 2004; Stark et al. 2007) 
as well as alterations in motif spacing relative to the start of 
transcription or to other motifs (Ihmels et al. 2005; Tanay 
et al. 2005). Recently, a number of genome-scale studies us- 
ing immunoprecipitation were performed to compare TF- 
binding patterns (Borneman et al. 2007; Tuch et al. 2008; 
Bradley et al. 2010; Lavoie et al. 2010; Schmidt et al. 
2010) or mRNA expression profiles across species (Ihmels 
et al. 2005; Tanay et al. 2005; Hogues et al. 2008; Field 
et al. 2009; Wapinski et al. 2010). Many of these studies 
have identified transcriptional programs that were dramat- 
ically rewired over short evolutionary time scales. For in- 
stance, Borneman et al. (2007) found that the TF Ted 
binds only 20% of the same target genes in comparisons 
between Saccharomyces cerevisiae and the closely related 
5. bayanus and 5. mikatae and that this difference is due 
to the gain and loss of canonical Ted c/s-regulatory motifs. 

Kim et al. (201 0) found abundant transcription at neuro- 
nal enhancers that are evolutionarily conserved. However, 
the lineage-specific HCNSs have only a few partial matches 
with ESTs and known RNA genes. This suggests that the ma- 
jority of HCNS functions are c/s-regulatory elements and in 
fact many studies reported that vertebrate HCNSs showed 
enhancer activities (e.g., Poulin et al. 2005; Woolfe et al. 
2005; Pennacchio et al. 2006; Lareau et al. 2007; Ni 
et al. 2007). In addition, we compared lineage-specific 
HCNSs and CNEs of other known vertebrate HCNSs (Woolfe 
et al. 2005, 2007) to determine whether there was any dif- 
ference in the results of comparisons with UCEs. The CNEs 
are over 1 ,400 HCNSs that were identified by human-fugu 
comparison and which include sequences overlapping with 
UCEs. As expected, we obtained similar results between 
noncoding UCEs and lineage-specific HCNSs in d/V and 
d5 and in shared genes analyses. 

We also found that there were differences between d5 
values of LHF genes and those of genome-wide genes (table 
2A and table 2B). It is known that the neutral mutation rate 
has regional biases in mammalian genomes (Mouse 



Genome sequencing Consortium 2002; Hardison et al. 
2003), and low d5 genes tend to have similar GO functional 
categories such as transcriptional regulation and develop- 
ment (Chuang and Li 2004). However, the low mutation 
rate of LHF genes does not affect substitution numbers in 
HCNSs because there is no correlation between d5 values 
and distances between HCNSs and their LHF genes (Pearson's 
r = 0.04 to 0.05). This raises the question as to why there is 
correlation between HCNSs and low d5 genes. 

One possible explanation is that a lower 65 gene may be 
constrained at nucleotide level when it affects splicing and/ 
or mRNA stability (Chamary et al. 2006). To investigate this, 
we examined the number of splicing variants in LHF genes. 
Both the primate and rodent LHF genes have significantly 
higher numbers of splicing variants than genome averages 
(one sample ttest, P < 2.50 x 1 0" 2 and P < 1 .87 x 1 0" 7 , 
respectively). However, LHF orthologs also showed relatively 
higher number of splicing variants (P < 3.0 x 10~ 2 and 
P < 2.8 x 10~ 2 , respectively). This does not explain the 
difference in d5 values between LHF genes and the LHF or- 
thologs. A second possibility is the change in chromatin 
structure. Genes that display a strong constraint at synon- 
ymous sites are preferentially located in closed regions of the 
genome because they require tight transcriptional regula- 
tion (Prendergast et al. 2007). Moreover, this strong con- 
straint on LHF genes at the nucleotide level suggests that 
many regulatory proteins may bind to the genes and interact 
with HCNSs for tight regulation of the gene expression. 

We carefully chose the species used in this study by con- 
sidering their coverage and quality. Nevertheless, there are 
some limitations in these analyses of lineage-specific HCNSs 
due to the small number of genomes of closely related spe- 
cies. We were able to use only two species genomes for the 
rodent lineage. We found small differences in genie catego- 
ries overlapping primate- and rodent-specific HCNSs (fig. 5A 
and B). The number of rodent-specific HCNSs overlapping 
intergenic and UTR was significantly larger than the compa- 
rable number in primates (P < 10~ 4 , Chi-squared test). We 
found another small difference in overrepresented gene 
functions. However, we cannot determine whether these 
differences are derived from lineage-specific characteristics, 
from the different number of species used for extraction of 
primate and rodent HCNSs or from annotation problems. 
Further studies of lineage-specific HCNSs are necessary to 
obtain clear pictures of lineage-specific evolution. 

In spite of the small differences, similar tendencies were 
found in the constraints on the primate and rodent LHF 
genes and their functions. Our analyses of LHF genes imply 
that the lineage-specific HCNSs were created in and around 
two categories of genes. In the first category, lineage- 
specific HCNSs are created near protein coding genes that 
had no HCNSs before. This expands the set of LHF genes 
that differ from those of ancestral (mammalian and verte- 
brate) HCNSs. These genes are more enriched in protein 
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binding, many of which are involved in nervous develop- 
ment, compared with ancestral HCNS flanking genes, sug- 
gesting that the lineage-specific evolution may be driven by 
changes in the regulation of protein interaction during 
nervous system development. In the second category, line- 
age-specific HCNSs are newly added to particular groups of 
genes that already have vertebrate HCNSs. One of the major 
gene groups codes transcriptional regulators involved in an- 
atomical development and may be involved in the various 
levels of lineage-specific evolution such as vertebrates, 
mammals, primates, and rodents. Many of the lineage- 
specific HCNSs and vertebrate HCNSs are likely to be 
associated with different gene sets. The lineage-specific 
evolution through HCNSs thus occurred by obtaining both 
new HCNS and LHF gene sets that differ from the "core" sets 
of vertebrate HCNS and its associated gene. 
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